ISSS608 AY2021-22(T2): Take-Home Exercise 3

Create a data visualisation showing average rating and proportion of cocoa percent (% chocolate) greater than or equal to 70% by top 15 company location.

Melissa Tan (SMU MITB Analytics Track)https://scis.smu.edu.sg/master-it-business/analytics-track/curriculum?gclid=CjwKCAiA3L6PBhBvEiwAINlJ9EJwYxpaZv-zPxR0UMntDh37TrlWU7jwXP9Dcu9jvWvN8uEJsOWzTRoCqrQQAvD_BwE
2022-02-21

1. The Task

Create a data visualisation showing average rating and proportion of cocoa percent (% chocolate) greater than or equal to 70% by top 15 company location.

Methods to visualise statistical uncertainty of point estimates by calculating confidence intervals of a mean and proportion were employed.

2. Data Preparation

To address the requirements of the task, chocolate.csv data set was used. The DT package was installed to display an interactive datatable to augment the graph. The crosstalk package was installed to link multiple HTML widgets (e.g. a graph and a datatable) within RMarkdown.

The data preparation was done as follows:

  1. Select the 3 columns of interest, company_location, rating and cocoa_percent from the original datatable. Convert the cocoa_percent column from character to numeric, removing the % symbol. Convert into a decimal for easier manipulation

Code chunk:

choc <- read_csv("data/chocolate.csv")

# Drop the % symbol in cocoa percent column and convert data type to numeric

choc$cocoa_percent<-gsub("%","",as.character(choc$cocoa_percent)) %>%
  as.numeric(choc$cocoa_percent)

# convert cocoa_percent into decimal for easier manipulation

choc$cocoa_percent <- 0.01*choc$cocoa_percent

choc_loc <- choc %>%
  select(`company_location`,`rating`,`cocoa_percent`)
  1. For Average Rating:

Code chunk:

avgR <- choc_loc %>%
  group_by(company_location) %>%
  summarise(nR=n(),
            meanR=mean(`rating`),
            sd = sd(`rating`)) %>%
  mutate(se=sd/sqrt(nR-1)) %>%
  slice_max(`nR`, n=15)

avgR$meanR <- round(avgR$meanR, digits = 2)
avgR$sd <- round(avgR$sd, digits = 2)
avgR$se <- round(avgR$se, digits = 2)
  1. For Cocoa Percentage:

Code chunk:

avgPct <- choc_loc %>%
  filter(choc_loc$cocoa_percent >= 0.7) %>%
  group_by(company_location) %>%
  summarise(nP=n(), meanP = mean(`cocoa_percent`)) %>%
  mutate(seP = sqrt(((`meanP`)*(1-`meanP`))/nP))%>%
  slice_max(`nP`, n=15) %>%
  mutate(meanP100 = meanP*100)

avgPct <- avgPct[,c("company_location", "nP", "meanP100", "seP", "meanP")]

avgPct$meanP100 <- round(avgPct$meanP100, digits = 1)
avgPct$seP <- round(avgPct$seP, digits = 3) 

3. Creating the Interactive Visualisation

To visualise the uncertainties, ggplotly was used with the following customisations:

In addition, a linked data table was created using the crosstalk method to allow users to: - View the full details of the frequency, mean, standard deviation and standard error - Sort any of these columns by clicking on the button at the top of each column

As the two components are linked, selecting any row in the table (e.g. the row with the highest rating or lowest cocoa percentage) will highlight the corresponding element on the confidence interval graph

The code chunk and visualisation for Average Rating are as follows:

#linked charts of Ratings
# Wrap data frame in SharedData

shared_rating = SharedData$new(avgR)

# Render graph
bscols( widths = c(12,12),
  ggplotly((ggplot(shared_rating) +
  geom_errorbar(
    aes(x=reorder(company_location,-meanR,), 
        ymin=meanR-1.96*se,
        ymax=meanR+1.96*se),
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=company_location, 
            y=meanR,
             text = paste("Location:", `company_location`,"<br>N:", `nR`,"<br>Avg. Rating:",`meanR`,"<br>95% CI:[", round((meanR-1.96*se), digits = 2), ",", round((meanR+1.96*se), digits = 2),"]")), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  xlab("Company Location") +
  ylab("Average Ratings") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))+
  ggtitle("95% Confidence Interval of Average Rating by Top 15 Locations")),
  tooltip = "text"),
  DT::datatable(shared_rating, rownames = FALSE, options = list(pageLength = 5, scrollX=T), colnames = c("No. of Locations", "Average Rating","Std Dev","Std Error"))
  )

The code chunk and visualisation for Cocoa Percentage are as follows:

#linked charts of Cocoa Percent
# Wrap data frame in SharedData

shared_pct = SharedData$new(avgPct)

# Render graph
bscols( widths = c(12,12),
  ggplotly((ggplot(shared_pct) +
  geom_errorbar(
    aes(x=reorder(company_location,-meanP,), 
        ymin=meanP-1.96*seP,
        ymax=meanP+1.96*seP),
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=company_location, 
            y=meanP,
            text = paste("Location:", `company_location`,"<br>N:", `nP`,"<br>Cocoa:",`meanP100`,"%<br>95% CI: [", round((meanP-1.96*seP)*100, digits = 1), "%, ", round((meanP+1.96*seP)*100, digits = 1),"%]")), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  xlab("Company Location") +
  ylab(" Average Cocoa Percentage (%)") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))+
  ggtitle("95% Confidence Interval of Cocoa Percentage 70% and above 
          by Top 15 Locations")),
  tooltip="text"),
  DT::datatable(shared_pct, rownames = FALSE, options = list(pageLength = 5,scrollX=T, columnDefs = list(list(visible=FALSE, targets=c(4)))),colnames = c("No. of Locations", "Avg Percentage","Pct Std Error","Pct"))
  )

5. Observations and Analysis

  1. U.S.A. was the top company location with 1136 locations for Average Rating and 974 for Cocoa Percentage, followed far behind by Canada (177, 163) and France (176, 130).

  2. The highest chocolate rating was from Australia at 3.36. Australia also clocked the second lowest cocoa percentage of 71.6% in the list.

  3. Ecuador scored the lowest average rating at 3.04, with the highest cocoa percentage at 76.7%,

  4. The lowest cocoa percentage in the top 15 came from Denmark, at 71.1%. Denmark also had the second highest average rating.

  5. The UK had the third highest cocoa percentage at 74.6% and the second lowest average rating at 3.07.

  6. The only Asian country in the top 15 list was Japan, with an average rating of 3.13 and cocoa percentage of 71.7%.

  7. Observations No. 2 to 5 seem to suggest that there may be consumer preferences linking higher ratings with lower cocoa content which warranted further investigation of whether there was any correlation between the two variables.

6. Correlation Analysis

A new datatable made by joining the mean ratings and mean cocoa percentage columns by company_locations. Unmatched rows were not included.

Using the correlation package, the r value of correlation was calculated to be -0.54, with a p-value of 0.048 (lower than the critical value of 0.05 at 95% confidence).

As strong correlations are indicated by r values closer to 1, while weak correlations are indicated by r values closer to 0, the r value of -0.54 in this case indicates a statistically significant low negative correlation between the chocolate’s rating and cocoa percentage.

#scatterplot of correlation

comb <- inner_join(avgR, avgPct, by = "company_location") 

comb <- select(comb, c(`company_location`,`meanR`, `meanP`))

cor_results <- correlation(comb)
cor_results
# Correlation Matrix (pearson-method)

Parameter1 | Parameter2 |     r |         95% CI | t(12) |      p
-----------------------------------------------------------------
meanR      |      meanP | -0.54 | [-0.83, -0.01] | -2.21 | 0.048*

p-value adjustment method: Holm (1979)
Observations: 14

The correlation was further visualised using the ggscatterstats function from the ggstatsplot package and similar results were obtained.

A Note on Confidence Intervals

In addition, error bars communicate the following information about the data:

6. Challenges Encountered

One unexpected challenge encountered in this exercise was that the use of crosstalk would disrupt the default CSS framework of distill. This resulted in the usual text formatting and sizing going haywire.

Upon further research, it was due to a Bootstrap HTML dependency attached to filter_select(), filter_checkbox(), and bscols(). This caused crosstalk to degrade the overall look when used in a non-Bootstrap CSS framework like distill.

RStudio developed a newer version of crosstalk in 2021 and the issue was resolved by installing the latest version of the package


7. References and Resources:

  1. Hands-on Exercise 4: Visual Analytics with R

  2. Pimp my RMD: a few tips for R Markdown

  3. DT: An R interface to the DataTables library

  4. Using Crosstalk

  5. Plotly in R: How to make ggplot2 charts interactive with ggplotly

  6. RStudio/crosstalk

  7. Biology for Life: Interpreting Error Bars